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Abstract. A general approach to the design of accurate classical potentials for protein folding is 
described. It includes the introduction of a meaningful statistical measure of the differences between 
approximations of the same potential energy, the definition of a set of Systematic and Approximately 
Separable and Modular Internal Coordinates (SASMIC), much convenient for the simulation of 
general branched molecules, and the imposition of constraints on the most rapidly oscillating 
degrees of freedom. All these tools are used to study the effects of constraints in the Conformational 
Equilibrium Distribution (CED) of the model dipeptide HCO-L-Ala-NH2. We use ab initio Quantum 
Mechanics calculations including electron correlation at the MP2 level to describe the system, and 
we measure the conformational dependence of the correcting terms to the naive CED based in the 
Potential Energy Surface (PES) without any simplifying assumption. These terms are related to 
mass-metric tensors determinants and also occur in the Fixman's compensating potential. We show 
that some of the corrections are non-negligible if one is interested in the whole Ramachandran 
space. On the other hand, if only the energetically lower region, containing the principal secondary 
structure elements, is assumed to be relevant, then, all correcting terms may be neglected up to 
peptides of considerable length. This is the first time, as far as we know, that the analysis of the 
conformational dependence of these correcting terms is performed in a relevant biomolecule with a 
realistic potential energy function. 
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INTRODUCTION 

Proteins are long chains comprised of twenty different amino acidic monomers and 
they are central elements in the biological machinery of all known living beings. They 
perform most of the catalytic tasks that are vital in the many coupled chains of chemical 
reactions occurring in the cells, they are found as structural building blocks in the 
cytoskeleton or in organelles, such as the ribosome, and they also play a very important 
role as membrane receptors. Their absence or malfunctioning is related to many diseases 
such as Creutzfeldt- Jakob's or Cancer [[ll,0] and the proteins involved in the biology of 
pathogens are often the preferred target of newly designed drugs (see the talks by E. 
Freire and C. Cavasotto in this meeting). 

Despite their complexity and the many opposing forces that determine their behaviour, 
these molecules swiftly acquire a unique three-dimensional native structure in the phys- 



iological milieu. Some details of this process are still not clear (see the talk by J. M. 
Sanchez-Ruiz), such as the relative proportion of the naturally occurring proteins that 
fold CO- or post-translationally (i.e., during or after biosynthesis at the ribosome) [[H], 
or the role played by molecular chaperones such as GroEL (see M. Karplus' talk), the 
Prolyl-peptidyl-isomerase, or the Protein disulfide isomerase, among others. However, 
since the pioneering work of Anfinsen [4], it is known that a large number of water- 
soluble globular proteins are capable of reaching their native structure in vitro after be- 
ing unfolded by changes in their environment, such as a raise of the temperature, the 
addition of denaturing agents or a change in the pH. It is the prediction of the native 
structure in these cases (only from the amino acid sequence and the laws of physics) that 
has become paradigmatic and receives the name of protein folding problem. 

In 2005, in a special section of the Science magazine entitled "What don't we know?" 
a selection of the hundred most interesting yet unanswered scientific questions 
was presented. What indicates the importance of the protein folding problem is not the 
inclusion of the question Can we predict how proteins will fold?, which was a must, but 
the large number of other questions which were related to or even dependent on it, such 
as Why do humans have so few genes?, How much can human life span be extended?. 
What is the structure of water?. How does a single somatic cell become a whole plant?. 
How many proteins are there in humans ?, How do proteins find their partners ?, How 
do prion diseases work?. How will big pictures emerge from a sea of biological data?. 
How far can we push chemical self-assembly? or Is an effective HIV vaccine feasible? , 
to quote just a few of them. 

Some authors [@] divide the problem in two parts: the prediction of the three- 
dimensional, biologically functional, native state of a protein and the description of the 
actual folding process that takes the protein there from the unfolded state. The first part, 
which is more pressing and more technologically oriented, is included in the second part 
and it is, therefore, easier to tackle, as the relative success of knowledge-based meth- 
ods suggests 10, [sl]. However, we believe that, not only much theoretical insight may be 
gained from a solution of the more general second part of the problem, but also much en- 
gineering and design power, as well as new comprehension about so distinct topics as the 
ones quoted in the preceding paragraph. This is why our approach is one of bottom-top 
and ab initio flavor. 

POTENTIAL ENERGY FUNCTIONS 

The fundamental theory of matter that is nowadays accepted as correct by the scientific 
community is Quantum Mechanics. For the study of the conformational behaviour of a 
molecule consisting of n atoms, with atomic numbers Z« and masses Ma, OC = 1, . . . ,n, 
one typically assumes that relativistic effects are negligibleQ and that, according to 
the Bom-Oppenheimer scheme f^, the great differences in mass between electrons 
and nuclei allows to consider that the former are described by a Hamiltonian which 



Which, in organic molecules, is approximately correct for all the particles involved, except, maybe, for 
some core electrons in the heaviest atoms. 



is adiabatically decoupled from the nuclear one and that depends only parametrically 
on the positions of the nuclei. Hence, the behaviour of the system in vacuum may be 
extracted from the non-relativistic time-independent nuclear Schrodinger equation: 

where E^{R) denotes the effective potential due to the electronic cloud in the funda- 
mental energy stateH and R is shorthand for Ri, ... ,Rn. 

Despite the exponential growth in computing power that has been taking place in the 
last decades (see, for example, the talks by A. Perczel and 1. Campos), a precise descrip- 
tion of the behaviour of any biologically interesting system derived from the solution of 
(HI) remains far from being even imaginable. Not to mention the huge complications that 
arise when the unavoidable inclusion of solvent is considered. This is why, omitting a 
myriad of possible intermediate descriptions, the most popular choice for the in silico 
pred iction of the protein folding process has become the use of the ^o-cdXlQA force fields 
in which one assumes that the behaviour of the macromolecule (omit- 
ting again the solvent, to compare with ([T])) is classical and may be described via a very 
simple potential energy function which, typically, has the form 




where are bond lengths, 9a are bond angles, <^a are dihedral angle£| and R^p 
denotes the interatomic distances. Finally, all the parameters entering (O (which may 
amount to thousands) are customarily fitted to reproduce thermodynamical measure- 
ments or taken from quantum mechanical calculations. 

While it is true that these empirical potentials may be detailed enough to deal with 
simple conformational transitions in already folded proteins (see the talk by J. Luque) or 
with collective motions of systems of many proteins (see M. Karplus' talk), and that 
they may also be used as scoring functions for protein design (as in the talk by A. 
Jaramillo), all these applications require only that the energetics of the native structure 
and its surroundings be correctly described. As A. Tramontano told us in her talk, the 
usefulness of these simple potentials for de novo structural prediction (assessed via the 
CASP contes10) remains much limited. 



This additional assumption that the electrons are in the fundamental state prevents us from describing 
the catalytic behaviour of most enzymes, however, the only interest here is to describe the folding process. 
^ For the sake of simplicity, no harmonic terms have been assumed for out-of-plane angles or for hard 
dihedrals, such as the peptide bond angle CO 
^ See |http : //predictioncenter . org| 



We believe that one of the reasons of this failure is the lack of accuracy of the 
potential energy functions used, since, even if the parameters fit is properly carried out, 
the choice of the very particular dependencies, for example those in constitutes a 
heavy restriction in the space of functions. Accordingly, one of our aims is the design of 
classical potentials which are as similar as possible to the effective Bom-Oppenheimer 
one in ©. To do this, one must calculate the electronic energy E^{R) using the powerful 
tools of Quantum Chemistry (see the talks by A. Perczel, J. J. Dannenberg and M. 
Amzel) and devise numerically efficient approximations to it. 

In any case, in order to walk the long path connecting Quantum Mechanics and a 
classical description amenable to nowadays computers, one must have a meaningful 
way of comparing different approximations of the potential energy of a system. Much in 
the spirit of the talk by M. Wall, and using the fact that the complex nature of biological 
molecules suggests the convenience of statistical analyses, we have designed in [14] a 
distance, denoted by dn, between any two different potential energy functions, Vi and 
V2, that, from a working set of conformations, measures the typical error that one makes 
in the energy differences if V2 is used instead of the more accurate Vi , admitting a linear 
rescaling and a shift in the energy reference. 

This distance, which has energy units, presents better properties than other quantities 
customarily used to perform these comparisons, such as the energy RMSD, the average 
energy error, etc. It may be related to the Pearson's correlation coefficient by 

^I2 = v^a2(l-r?2)'/^ (3) 

Finally, due to its physical meaning, it has been argued in fl^ that, if the distance 
between two different approximations of the energy of the same system is less than RT, 
one may safely substitute one by the other without altering the relevant dynamical or 
thermodynamical behaviour. 



EFFECTS OF CONSTRAINTS 

Another reason underlying the difficulties faced in the computational study of the protein 
folding problem is that the large number of degrees of freedom brings up the necessity 



to sample an astronomically large conformational space [|15l] . In addition, the typical 
timescales of the different movements are in a wide range and, therefore, demandingly 
small timesteps must be used in Molecular Dynamics simulations in order to properly 
account for the fastest modes [16], which lie in the femtosecond range; whereas the 
folding of a large protein may take seconds. In order to deal with these problems, one 
may naturally consider the reduction of the number of degrees of freedom describing 
macromolecules via the imposition of constraints. 

To manage this situation, we have made progresses in two directions. First, we have 
devised [|l7] a set of internal coordinates called SASMIC (standing for Systematic and 
Approximately Separable and Modular Internal Coordinates), which are much conve- 
nient to describe branched molecules and, specially, polypeptides, without having to 
rewrite the whole Z-matrix upon addition of new residues to the chain, and also allow to 



maximally separate the soft and hard movement^. 

Second, we have used these coordinates, the distance discussed before and the factor- 
ization of the external variables in the mass-metric determinants that we describe in [llSll . 
to study the possibility of neglecting the conformational dependence of the correcting 
terms that appear in the equilibrium distribution of organic molecules [11911 . 

Constraining the hard coordinates q' to be specific functions f^{q') of the soft ones 
(which defines a hypersurface E in the whole conformational space) produces two clas- 
sical constrained models which are known to be conceptually [|20l . and practically 
[I22L I23I1 inequivalent: they are called stiff and rigid. In the classical rigid model, the 
constraints are assumed to be exact and all the velocities that are orthogonal to the hy- 
persurface defined by them vanish. In the classical stiff model, on the other hand, the 
constraints are assumed to be approximate and they are implemented by a steep poten- 
tial that drives the system to the constrained hypersurface. In this case, the orthogonal 
velocities are activated and may act as 'heat containers'. 

The conformational equilibrium of the system, according to these models, is described 
by the following probability densities [|l9ll : 



Classical Stiff Model 



Classical Rigid Model 
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where /3 := 1 /RT, Vj: is the potential energy in E (the Potential Energy Surface (PES) 
frequently used in Quantum Chemistry), and G, g and denote, respectively, the 
whole-space mass-metric tensor, the reduced mass-metric tensor in E and the Hessian of 
the constraining part of the potential. 

The different terms that correct the PES Vz in dU are regarded (and denoted) as en- 
tropies because they are linear in the temperature T and come from the averaging out of 
certain degrees of freedom (sometimes coordinates, sometimes momenta). Accordingly, 
the effective potentials occurring in the exponent of the equilibrium probabilities are 
regarded (and denoted) as free energies. 



Now, if Monte Carlo simulations in the coordinate space are to be performed [1241, |25D 
and the probability densities that correspond to any of these two models sampled, the 
correcting entropies in dU should be included or, otherwise, showed to be negligible. 

On the other hand, if rigid Molecular Dynamics simulations are performed with the 
intention of sampling from the stiff equilibrium probability [l2a 123, IZSD, then, the 



An automatic Perl script that generates the SASMIC Z-matrix, in the format of typical Quantum 
Chemistry packages, such as GAMESS or GaussianOS, from the sequence of amino acids, may be found 

at http : //neptuno . unizar . es/ files /public/ gen_sasmic/ 
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FIGURE 1. a) Model dipeptide HCO-L-Ala-NH2 numbered according to the SASMIC [17] scheme, 
b) Potential Energy Surface, c) Conformational dependence of the correcting terms. All energies are given 
in kcal/mol. 



so-called Fixman 's compensating potential [|29l1 , 
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(5) 



must be added to the PES V^. 

The conformational dependence of most of the determinants appearing in (Hj) and 
(|5]) is frequently assumed to be negligib le in the literature and they are consequently 
dropped from the calculations [3^, 31, 32, 33']. Also, subtly entangled to the assumptions 
underlying these simplifications, a second type of approximation is made that consists 
of assuming that the equilibrium values of the hard coordinates do not depend on the 
soft coordinates |[3ll [12,1331, [13] ■ This has been argued to be only approximate even in 
the case of classical force fields |35i 3^ 37]. 

In [l19], we have eliminated all simplifying assumptions and measured the conforma- 
tional dependence on the Ramachandran angles ^ and i// (the soft coordinates) of all cor- 
recting terms and of the Fixman's compensating potential in the model dipeptide HCO- 
L-Ala-NH2. The potential energy function used was the effective Born-Oppenheimer 
potential for the nuclei (see ([T])) derived from ab initio quantum mechanical calculations 
including electron correlation at the MP2/6-31-i-i-G(d,p) level of the theory. 

In table [TJ the main results of our work are presented. The importance of all the 
correcting terms is assessed by comparing (with the statistical distance d\2 described in 
the previous section) the effective potential Vi , containing the term, with the approximate 
one V2, lacking it. Moreover, if one assumes that the effective energies compared will 
be used to construct a polypeptide potential, the number A^res of residues up to which 
one may go keeping the distance between the two approximations of the the A^^-residue 
potential below RT is (see eq. (23) in [|141 ): 



TABLE 1. Quantitative assessment of the importance of the different 
correcting terms involved in the study of the constrained equihbrium 
of the protected dipeptide HCO-L-Ala-NH2 (see [19]). 



Corr.l yj. y£: flfi£ Nre^ ri£ 



-TS^-TSl F, Vz 0.74- RT 1.82 0.98 0.9967 
-TS", F, Vz-TS^ 0.14 RT 1.83 0.98 0.9967 
-TS\ F, Vz-TS", O.ll RT 80.45 1.00 0.9999 



-TS^ F, Vz 0.29 RT 11.62 1.01 0.9995 
Vf F, 0.67 RT 2.24 0.97 0.9972 



* Correcting term whose importance is measured in the corresponding row 

^ 'Correct' potential energy; the one containing the correcting term 

** 'Approximate' potential energy; the one lacking the correcting term 

Statistical distance between V\ and V2 (see [14 1) 

^ Number of residues in a polypeptide potential up to which the correcting 
term may be omitted 

Slope of the linear rescaling between Vi and V2 

II Pearson's correlation coefficient 



In the table, one can see that, in the stiff model, the Hessian-related correcting term 
should be included in Monte Carlo simulations for peptides as short as two residues, 
while the one that depends on G may be neglected up to chains which are ~ 80 residues 
long. The only correcting term occurring in the rigid model, in turn, may be dropped up 
to ~ 12 residues. Finally, the Fixman potential, containing all determinants, should be 
included in MD rigid simulations of peptides with more than two residue^. 

These results are related to a working set of conformations consisting of 144 points 
regularly distributed in the whole Ramanchandran space. In a second part of the work, 
we have repeated all the comparisons for a working set consisting of six secondary 
structure elements. The results suggest that, if one is interested only in this energetically 
lower region, the distances <ii2 are roughly divided by two and, accordingly, the values 
of A^res are four times larger. 

We have also repeated the calculations, with the same basis set (6-31++G(d,p)) and at 
the Hartree-Fock level of the theory in order to investigate if this less demanding method 
without electron correlation may be used in further studies. We have found that, indeed, 
this can be done, obtaining very similar results at a tenth of the computational effort. 

As far as we are aware, this is the first time that this type of study is performed in a 
relevant biomolecule with a realistic potential energy function. 




(6) 



° One should note that the distance between the PES Vz at MP2/6-31++G(d,p) and the one computed at 
HF/6-31++G(d,p) is di2 — 1.2 RT . A value slightly larger but of the order of the ones obtained when the 
most important correcting terms are dropped. 
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